👌 The new columns “quarter” and “sales” are variables with their own columns.
Plotting the longer data
This is a common transformation, as it is easier to do data entry via a wider format, but the tools we use in programming often requires it in a longer format.
tidy1
Try plotting the transformed data tidy1
map x to Quarter, y to Sales, and group to Store
ggplot(tidy1) + aes(x = Quarter,
y = Sales,
group = Store) +
geom_line()
ggplot(tidy1) +aes(x = Quarter, y = Sales, group = Store) +geom_line()
Thinking about pivot_longer()
What columns in this dataset could be combined into one column?
relig_income_sm[1:5,]
You’ll have the opportunity to work on this at the end of class.
Situation: One column maps to multiple variables
untidy2
What’s the relationship between KRAS mutation and KRAS expression?
😩 Observation is a sample’s gene type?
😩 The variables we want, KRAS_mutation and KRAS_expression, are in rows. The current columns contain multiple types of info: gene contains both mutations and expression, and value contains both gene expression and mutational status.
pivot_wider()
Solution: make the data into a wider format. Tidyexplain
.
pivot_wider()
Solution: make the data into a wider format. Tidyexplain
Situation: Multiple variables are stored in a single column
untidy3
😩 The rate column’s values have multiple values in its cell.
separate()
Solution: Let’s separate it:
Before
untidy3
After
tidy3 =separate(untidy3, col ="rate", into =c("count", "population"), sep ="/")tidy3
Subjectivity in Tidy data
We have looked at clear cases of when a dataset is Tidy. In reality, the Tidy state depends on what we call variables and observations for the analysis we want to conduct.
Tools such as ggplot require precise definition of our variables, so planning ahead what we want to use with our tools creates clarity of what we call variables and observations.
Tip: think about what you want to do with the data, and work backwards. That will help you identify whether the data is tidy or not.
Challenge
What analysis would you want to do with this dataset, and what kind of transformation would you do to get it Tidy?